Search CORE

41 research outputs found

Compilation Techniques for High-Performance Embedded Systems with Multiple Processors

Author: Franke Bjorn
Publication venue: University of Edinburgh. College of Science and Engineering. School of Informatics.
Publication date: 01/01/2004
Field of study

Institute for Computing Systems ArchitectureDespite the progress made in developing more advanced compilers for embedded systems, programming of embedded high-performance computing systems based on Digital Signal Processors (DSPs) is still a highly skilled manual task. This is true for single-processor systems, and even more for embedded systems based on multiple DSPs. Compilers often fail to optimise existing DSP codes written in C due to the employed programming style. Parallelisation is hampered by the complex multiple address space memory architecture, which can be found in most commercial multi-DSP configurations. This thesis develops an integrated optimisation and parallelisation strategy that can deal with low-level C codes and produces optimised parallel code for a homogeneous multi-DSP architecture with distributed physical memory and multiple logical address spaces. In a first step, low-level programming idioms are identified and recovered. This enables the application of high-level code and data transformations well-known in the field of scientific computing. Iterative feedback-driven search for “good” transformation sequences is being investigated. A novel approach to parallelisation based on a unified data and loop transformation framework is presented and evaluated. Performance optimisation is achieved through exploitation of data locality on the one hand, and utilisation of DSP-specific architectural features such as Direct Memory Access (DMA) transfers on the other hand. The proposed methodology is evaluated against two benchmark suites (DSPstone & UTDSP) and four different high-performance DSPs, one of which is part of a commercial four processor multi-DSP board also used for evaluation. Experiments confirm the effectiveness of the program recovery techniques as enablers of high-level transformations and automatic parallelisation. Source-to-source transformations of DSP codes yield an average speedup of 2.21 across four different DSP architectures. The parallelisation scheme is – in conjunction with a set of locality optimisations – able to produce linear and even super-linear speedups on a number of relevant DSP kernels and applications

CiteSeerX

Edinburgh Research Archive

High spatial resolution measurement of oxygen consumption rates in permeable sediments

Author: de Beer Dirk
Franke Ulrich
Grunwald Bjorn
Polerecky Lubos
Werner Ursula
Publication venue: 'Wiley'
Publication date: 07/02/2005
Field of study

A method is presented for the measurement of depth profiles of volumetric oxygen consumption rates in permeable sediments with high spatial resolution. When combined with in situ oxygen microprofiles measured by microsensors, areal rates of aerobic respiration in sediments can be calculated. The method is useful for characterizing sediments exposed to highly dynamic advective water exchange, such as intertidal sandy sediments. The method is based on percolating the sediment in a sampling core with aerated water and monitoring oxygen in the sediment using either an oxygen microelectrode or a planar oxygen optode. The oxygen consumption rates are determined using three approaches: (1) as the initial rate of oxygen decrease measured at discrete points after the percolation is stopped, (2) from oxygen microprofiles measured sequentially after the percolation is stopped, and (3) as a derivative of steady-state oxygen microprofiles measured during a constant percolation of the sediment. The spatial resolution of a typical 3 to 4 cm profile within a measurement time of 1 to 2 h is better with planar optodes (˜0.3 mm) then with microelectrodes (2 to 5 mm), whereas the precision of oxygen consumption rate measurements at individual points is similar (0.1 to 0.5 µmol L–1 min–1) for both sensing methods. The method is consistent with the established methods (interfacial gradients combined with Fick’s law of diffusion, benthic-chambers), when tested on the same sediment sample under identical, diffusion-controlled conditions

Crossref

University of Queensland eSpace

Towards Automatic Parallelisation for Multi-Processor DSPs

Author: Bjorn Franke Michael
Publication venue
Publication date
Field of study

This paper describes a preliminary compiler based approach to achieving high performance DSP applications by automatically mapping C programs to multi-processor DSP systems. DSP programs typically contain pointer based memory accesses making automatic parallelisation difficult. This paper presents a new method to convert a restricted class of pointer-based memory accesses into array accesses with explicit index functions suitable for parallelisation. Different parallelisation approaches suitable for multi-processor DSPs are considered. We implemented our pointer conversion algorithm in the prototype Octave compiler where experimental results demonstrated that our technique increases the number of parallelisable loops from 6 to 24 for 11 of the DSPstone benchmarks. Furthermore our technique is shown to also improve the actual performance of DSP codes on single processor systems decreasing execution time by up to 33%

CiteSeerX

Profile-driven parallelisation of sequential programs

Author: Franke Bjorn
Tournavitis Georgios
Publication venue
Publication date: 01/01/2011
Field of study

Traditional parallelism detection in compilers is performed by means of static analysis and more specifically data and control dependence analysis. The information that is available at compile time, however, is inherently limited and therefore restricts the parallelisation opportunities. Furthermore, applications written in C – which represent the majority of today’s scientific, embedded and system software – utilise many lowlevel features and an intricate programming style that forces the compiler to even more conservative assumptions. Despite the numerous proposals to handle this uncertainty at compile time using speculative optimisation and parallelisation, the software industry still lacks any pragmatic approaches that extracts coarse-grain parallelism to exploit the multiple processing units of modern commodity hardware. This thesis introduces a novel approach for extracting and exploiting multiple forms of coarse-grain parallelism from sequential applications written in C. We utilise profiling information to overcome the limitations of static data and control-flow analysis enabling more aggressive parallelisation. Profiling is performed using an instrumentation scheme operating at the Intermediate Representation (Ir) level of the compiler. In contrast to existing approaches that depend on low-level binary tools and debugging information, Ir-profiling provides precise and direct correlation of profiling information back to the Ir structures of the compiler. Additionally, our approach is orthogonal to existing automatic parallelisation approaches and additional fine-grain parallelism may be exploited. We demonstrate the applicability and versatility of the proposed methodology using two studies that target different forms of parallelism. First, we focus on the exploitation of loop-level parallelism that is abundant in many scientific and embedded applications. We evaluate our parallelisation strategy against the Nas and Spec Fp benchmarks and two different multi-core platforms (a shared-memory Intel Xeon Smp and a heterogeneous distributed-memory Ibm Cell blade). Empirical evaluation shows that our approach not only yields significant improvements when compared with state-of- the-art parallelising compilers, but comes close to and sometimes exceeds the performance of manually parallelised codes. On average, our methodology achieves 96% of the performance of the hand-tuned parallel benchmarks on the Intel Xeon platform, and a significant speedup for the Cell platform. The second study, addresses the problem of partially sequential loops, typically found in implementations of multimedia codecs. We develop a more powerful whole-program representation based on the Program Dependence Graph (Pdg) that supports profiling, partitioning and codegeneration for pipeline parallelism. In addition we demonstrate how this enhances conventional pipeline parallelisation by incorporating support for multi-level loops and pipeline stage replication in a uniform and automatic way. Experimental results using a set of complex multimedia and stream processing benchmarks confirm the effectiveness of the proposed methodology that yields speedups up to 4.7 on a eight-core Intel Xeon machine.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

OpenGrey Repository

Using machine-learning to efficiently explore the architecture/compiler co-design space

Author: Dubach Christophe
Franke Bjorn
O’Boyle Michael
Publication venue
Publication date: 01/01/2009
Field of study

Designing new microprocessors is a time consuming task. Architects rely on slow simulators to evaluate performance and a significant proportion of the design space has to be explored before an implementation is chosen. This process becomes more time consuming when compiler optimisations are also considered. Once the architecture is selected, a new compiler must be developed and tuned. What is needed are techniques that can speedup this whole process and develop a new optimising compiler automatically. This thesis proposes the use of machine-learning techniques to address architecture/compiler co-design. First, two performance models are developed and are used to efficiently search the design space of amicroarchitecture. These models accurately predict performance metrics such as cycles or energy, or a tradeoff of the two. The first model uses just 32 simulations to model the entire design space of new applications, an order of magnitude fewer than state-of-the-art techniques. The second model addresses offline training costs and predicts the average behaviour of a complete benchmark suite. Compared to state-of-the-art, it needs five times fewer training simulations when applied to the SPEC CPU 2000 and MiBench benchmark suites. Next, the impact of compiler optimisations on the design process is considered. This has the potential to change the shape of the design space and improve performance significantly. A new model is proposed that predicts the performance obtainable by an optimising compiler for any design point, without having to build the compiler. Compared to the state-of-the-art, this model achieves a significantly lower error rate. Finally, a new machine-learning optimising compiler is presented that predicts the best compiler optimisation setting for any new program on any new microarchitecture. It achieves an average speedup of 1.14x over the default best gcc optimisation level. This represents 61% of the maximum speedup available, using just one profile run of the application.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

OpenGrey Repository

Compiler-driven data layout transformations for network applications

Author: Fenacci Damon
Franke Bjorn
Mahesh Marina
Publication venue
Publication date: 01/01/2012
Field of study

This work approaches the little studied topic of compiler optimisations directed to network applications. It starts by investigating if there exist any fundamental differences between application domains that justify the development and tuning of domain-specific compiler optimisations. It shows an automated approach that is capable of identifying domain-specific workload characterisations and presenting them in a readily interpretable format based on decision trees. The generated workload profiles summarise key resource utilisation issues and enable compiler engineers to address the highlighted bottlenecks. By applying this methodology to data intensive network infrastructure application it shows that data organisation is the key obstacle to overcome in order to achieve high performance. It therefore proposes and evaluates three specialised data transformations (structure splitting, array regrouping, and software caching) against the industrial EEMBC networking benchmarks and real-world data sets. It also demonstrates on one hand that speedups of up to 2.62 can be achieved, but on the other that no single solution performs equally well across different network traffic scenarios. Hence, to address this issue, an adaptive software caching scheme for high frequency route lookup operations is introduced and its effectiveness evaluated one more time against EEMBC networking benchmarks and real-world data sets achieving speedups of up to 3.30 and 2.27. The results clearly demonstrate that adaptive data organisation schemes are necessary to ensure optimal performance under varying network loads. Finally this research addresses another issue introduced by data transformations such as array regrouping and software caching, i.e. the need for static analysis to allow efficient resource allocation. This thesis proposes a static code analyser that allows the automatic resource analysis of source code containing lists and tree structures. The tool applies a combination of amortised analysis and separation logic methodology to real code and is able to evaluate type and resource usage of existing data structures, which can be used to compute global resource consumption values for full data intensive network applications.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

OpenGrey Repository

Increasing the efficacy of automated instruction set extension

Author: Bennett Richard Vincent
Franke Bjorn
Topham Nigel
Publication venue
Publication date: 01/01/2011
Field of study

The use of Instruction Set Extension (ISE) in customising embedded processors for a specific application has been studied extensively in recent years. The addition of a set of complex arithmetic instructions to a baseline core has proven to be a cost-effective means of meeting design performance requirements. This thesis proposes and evaluates a reconfigurable ISE implementation called “Configurable Flow Accelerators” (CFAs), a number of refinements to an existing Automated ISE (AISE) algorithm called “ISEGEN”, and the effects of source form on AISE. The CFA is demonstrated repeatedly to be a cost-effective design for ISE implementation. A temporal partitioning algorithm called “staggering” is proposed and demonstrated on average to reduce the area of CFA implementation by 37% for only an 8% reduction in acceleration. This thesis then turns to concerns within the ISEGEN AISE algorithm. A methodology for finding a good static heuristic weighting vector for ISEGEN is proposed and demonstrated. Up to 100% of merit is shown to be lost or gained through the choice of vector. ISEGEN early-termination is introduced and shown to improve the runtime of the algorithm by up to 7.26x, and 5.82x on average. An extension to the ISEGEN heuristic to account for pipelining is proposed and evaluated, increasing acceleration by up to an additional 1.5x. An energyaware heuristic is added to ISEGEN, which reduces the energy used by a CFA implementation of a set of ISEs by an average of 1.6x, up to 3.6x. This result directly contradicts the frequently espoused notion that “bigger is better” in ISE. The last stretch of work in this thesis is concerned with source-level transformation: the effect of changing the representation of the application on the quality of the combined hardwaresoftware solution. A methodology for combined exploration of source transformation and ISE is presented, and demonstrated to improve the acceleration of the result by an average of 35% versus ISE alone. Floating point is demonstrated to perform worse than fixed point, for all design concerns and applications studied here, regardless of ISEs employed.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

OpenGrey Repository

Integrating profile-driven parallelism detection and machine-learning based mapping

Author: Franke Bjorn
O'Boyle Michael
Tournavitis Georgios
Wang Zheng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

Compiler-based auto-parallelization is a much-studied area but has yet to find widespread application. This is largely due to the poor identification and exploitation of application parallelism, resulting in disappointing performance far below that which a skilled expert programmer could achieve. We have identified two weaknesses in traditional parallelizing compilers and propose a novel, integrated approach resulting in significant performance improvements of the generated parallel code. Using profile-driven parallelism detection, we overcome the limitations of static analysis, enabling the identification of more application parallelism, and only rely on the user for final approval. We then replace the traditional target-specific and inflexible mapping heuristics with a machine-learning-based prediction mechanism, resulting in better mapping decisions while automating adaptation to different target architectures. We have evaluated our parallelization strategy on the NAS and SPEC CPU2000 benchmarks and two different multicore platforms (dual quad-core Intel Xeon SMP and dual-socket QS20 Cell blade). We demonstrate that our approach not only yields significant improvements when compared with state-of-the-art parallelizing compilers but also comes close to and sometimes exceeds the performance of manually parallelized codes. On average, our methodology achieves 96% of the performance of the hand-tuned OpenMP NAS and SPEC parallel benchmarks on the Intel Xeon platform and gains a significant speedup for the IBM Cell platform, demonstrating the potential of profile-guided and machine-learning- based parallelization for complex multicore platforms

Lancaster E-Prints

Exploitation of GPUs for the parallelisation of probably parallel legacy code

Author: Franke Bjorn
O'Boyle Michael
Powell Daniel
Wang Zheng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

General purpose Gpus provide massive compute power, but are notoriously difficult to program. In this paper we present a complete compilation strategy to exploit Gpus for the parallelisation of sequential legacy code. Using hybrid data dependence analysis combining static and dynamic information, our compiler automatically detects suitable parallelism and generates parallel OpenCl code from sequential programs. We exploit the fact that dependence profiling provides us with parallel loop candidates that are highly likely to be genuinely parallel, but cannot be statically proven so. For the efficient Gpu parallelisation of those probably parallel loop candidates, we propose a novel software speculation scheme, which ensures correctness for the unlikely, yet possible case of dynamically detected dependence violations. Our scheme operates in place and supports speculative read and write operations. We demonstrate the effectiveness of our approach in detecting and exploiting parallelism using sequential codes from the Nas benchmark suite. We achieve an average speedup of 3.2x, and up to 99x, over the sequential baseline. On average, this is 1.42 times faster than state-of-the-art speculation schemes and corresponds to 99% of the performance level of a manual Gpu implementation developed by independent expert programmers

Crossref

Lancaster E-Prints